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Abstract. This work concerns testing the number ol parameters in one hidden 
layer multilayer perceptron (MLP). For this purpose we assume that we have iden- 
tifiable models, up to a finite group of transformations on the weights, this is for 
example the case when the number of hidden units is know. In this framework, we 
show that we get a simple asymptotic distribution, if we use the logarithm of the 
determinant of the empirical error covariance matrix as cost function. 
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1 Introduction 

Consider a sequence (Yt, Zt)^^-^ of i.i.d0 random vectors (i.e. identically 
distributed and independents). So, each couple {Yt,Zt) has the same law 
that a generic variable {Y, Z) G 
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1.1 The model 

Assume that the model can be written 

Yt=Fwo{Zt)+et 

where 

• F\iyo is a function represented by a one hidden layer MLP with parameters 
or weights W'^ and sigmoidal functions in the hidden unit. 

• The noise, {etjtefij is sequence of i.i.d. centered variables with unknown 
invertible covariance matrix r{W'^). Write e the generic variable with 
the same law that each et- 



Notes that a finite number of transformations of the weights leave the MLP 
functions invariant, these permutations form a finite group (see [Sussman, 1992] ). 
To overcome this problem, we will consider equivalence classes of MLP : two 

^ It is not hard to extend all what we show in this paper for stationary mixing 
variables and so for time series 
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MLP are in the same class if the first one is the image by such transformation 
of the second one, the considered set of parameter is then the quotient space 
of parameters by the finite group of transformations. 

In this space, we assume that the model is identifiable, this can be 
done if we consider only MLP with the true number of hidden units (see 
[Sussma n, 1992| ). Note that, if the number of hidden units is over-estimated, 
then such test can have very bad behavior (see [Fukumizu, 2003| ). We agree 
that the assumption of identifiability is very restrictive, but we want empha- 
size the fact that, even in this framework, classical test of the number of 
parameters in the case of multidimensional output MLP is not satisfactory 
and we propose to improve it. 



1.2 testing the number of parameters 

Let q be an integer lesser than s, we want to test "TJq W ^ 0q C K''" against 
"Hi : W & 0s C M.^" , where the sets 0q and 0s are compact. Hq express 
the fact that W belongs to a subset of 0s with a parametric dimension lesser 
than s or, equivalently, that s — q weights of the MLP in 0s are null. If we 
consider the classic cost function : T4(W^) = X^tLi 11^* ~ Fw{Zt)\\^ where 
||a;|| denotes the Euclidean norm of x, we get the following statistic of test : 

Sn^nx ( min VJW) - min VJW) 

It is shown in |Yao, 2000] , that S'„ converges in law to a ponderated sum of 

xl 



s-q 



1=1 

where the xf i 3.re s — q i.i.d. Xi variables and Xi are strictly positives values, 
different of 1 if the true covariance matrix of the noise is not the identity. 
So, in the general case, where the true covariance matrix of the noise is not 
the identity, the asymptotic distribution is not known, because the Xi are not 
known and it is difficult to compute the asymptotic level of the test. 

To overcome this difficulty we propose to use instead the cost function 

C7„ (W) Indet (^^ J^iYt - Fw{Zt))(Yt ~ Fw{Zt)f^ . (1) 

we will show that, under suitable assumptions, the statistic of test : 



T„ = n X min C/„(W^) - inin U„{W) (2) 

will converge to a classical Xs-g the asymptotic level of the test will be 
very easy to compute. The sequel of this paper is devoted to the proof of 
this property. 
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2 Asymptotic properties of T„ 

In order to investigate the asymptotie properties of the test we have to prove 
the consistency and the asymptotic normahty of Wn = argminvi/g©^ Un{W). 
Assume, in the sequel, that e has a moment of order at least 2 and note 

^n{W) = ^ Y.{Y, - Fw{Zt)){Yt - Fw{Zt)f 

remark that these matrix F„(VF) and it inverse are symmetric, in the same 
way, we note r{W) = lim„^oo -^n(W^)) which is well defined because of the 
moment condition on e 

2.1 Consistency of Wn 

First we have to identify contrast function associated to C/„(W) 
Lemma 1 

Un{W) - Un{W°) n K{W,W°) 

with K{W, W^) > and K{W, = if and only ifW = W°. 

Proof : By the strong law of large number we have 

Un{W) - Un{W°) n lndet(r(MK)) - lndet{r{W°)) = In = 

indet {r-\w°) {r{w) - r{w°)) + Id) 

where Id denotes the identity matrix of M**. So, the lemme is true if r{W) — 
r{W^) is a positive matrix, null only if = W^. But this property is true 

since 

r{W) =E{{Y- Fw{Z)){Y - Fw{Z)Y) = 

E ((y - Fv^o(Z) + Fw<^{Z) - Fw{Z)){Y - Fwo{Z) + Fwo{Z) - Fw{Z)f) = 

E Uy - F^o{Z))(Y - Fwo{Z)Y) + 

E {{Fwo{Z) - Fw{Z)){Fwo{Z) - Fw{Z))^) = 

FiW) + E {{Fwo{Z) ~ Fw{Z)){Fwo{Z) - Fw{Z)Y) ■ 

We deduce then the theorem of consistency : 

Theorem 1 //£ (||£||2) < oo, 
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Proof Remark that it exist a constant B such that 

supw^eM - Fw{Z)\'' <\\Yf + B 

because Os is compact, so Fw{Z) is bounded. For a matrix A S R'^^'', let 
\\A\\ be a norm, for example = tr (AA^). We have 

liminf^ee^ l|A(W^)|| = > 

limsupi^ge^ ||r'„(l¥)|| := C < oo 

and since the function : 

r ^ indetr, for c > > |ir(Ty°)|| 

is uniformly continuous, by the same argument that example 19.8 of 

|Van der Vaart, 1998] the set of functions Un{W), W € Os is Glivenko- 

Cantelli. 

Finally, the theorem 5.7 of Van der Vaart, 1998 , show that Wn converge 
in probability to ■. 



2.2 Asymptotic normality 

For this purpose we have to compute the first and the second derivative with 
respect to the parameters of Un{W). First, we introduce a notation : if 
F\y(X) is a d-dimensional parametric function depending of a parameter W, 

write ^^^^'^ (resp. for the d-dimensional vector of partial deriva- 

tive (resp. second order partial derivatives) of each component of Fw{X). 

First derivatives : if Fn (W) is a matrix depending of the parameter vector 
W, we get from [Magnus and Neudecker, 1988| 

Indet (FniW)) = tr F-'iW)——Fn{W) 



dWk ' ' " \ " ' 'dWk 

Hence, if we note 



t=i ^ 
using the fact 

tr {F-\W)A^r.iWk)) = tr {AliWk)F-\W)) ^ tr {F~'iW)AliW,)) 
we get 

^ Indet (FniW)) = 2tr {F-\W)AniWk)) (3) 



^W^ 
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Second derivatives : We write now 



and 



We get 



B,,{Wk,Wi):=-J2i 

n / 

Cn{Wk,Wi):=-y^{-{yt-Fw{ 



dFw{zt)dFw{zt)- 
dWk dWi 



''''' dWkdWi 



'4^ = ^.'^tr{r-\W)A^{W,)) = 

2*''(^%Hp^^"(^fc)) +2^*^ {r-\W)Br,{Wk,Wi)) +2tr (r„(W^)-iC„(l^fe, M^,)) 

Now, [Magnus and Neudecker, 1988| , give an analytic form of the derivative 
of an inverse matrix, so we get 

1^ = 2tr {r-\W) {AM) + AliW^)) r~\W)AM)) + 
2tr {r-\W)Br,{Wu,Wi)) + 2tr {r-\W)C^{Wu,Wi)) 



so 



= Atr {r-\W)A^{W^)r-\W)Ar.{Wu)) 
+2tr {r-\W)B^{Wu.Wi)) + 2tr {r-\W)Cn{Wk.Wi)) 

Asymptotic distribution of Wn '■ The previous equations allow us to give the 
asymptotic properties of the estimator minimizing the cost function J7„(W^), 
namely from equation ([3]) and ^ we can compute the asymptotic properties 
of the first and the second derivatives of Un{W). If the variable Z has a 
moment of order at least 3 then we get the following lemma : 

Theorem 2 Assume that E (||e|p) < oo and E {\\Z\\^) < oo, let AUniW°) 
be the gradient vector ofUn{W) at W'^ and HUn{W^) be the Hessian matrix 
ofUn{W) at W°. 
Write finally 

dWk dWi 

We get then 

1. HUniW'') 2Io 

2. V^zit/„(W^O)'^-TAA(0,4/o) 

3. V^{Wn--W^) ^A"'AA(0,/o-i) 

where, the component (fc, I) of the matrix Iq is : 

tr {r,'E{B{W^,W^))) 
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proof : We can show easily that, for all a; G M'', we have : 

ll^^ll <Cte(l + ||Z||) 
||S^||<Cte(l + ||Zf) 

'-^^ - ^^^smW < cte\\w-wmi + \\Zf) 



Write 



dWkdWi dWkdW, 



AiW,)^{-^-^iY-F^iZ)f 



and UiW) := logdet(r - Fw{Z)). 

Note that the component (fc, I) of the matrix 4/o is: 

and, since the trace of the product is invariant by circular permutation, 

p / dUjW") dUiW°] 



{-'-^^r„-\Y - F^oiz)){Y F^^o{z)fr„-^ (-^W^)) 



4P { dF„o{Z)-^ ^-l dF„o{Z) 
' dWk dWi 



Atr F, 



awi 

4tr (F,-'e(b{WS,W^))) 

Now, the derivative ^'^/^f'' is square integrable, so AUn{W^) fulfills Linde- 
berg's condition (see [Hall and Heyde, 1980| ) and 

v^Z\[7„(W^°)'^A"'AA(0,4/o) 

For the component (fc, I) of the expectation of the Hessian matrix, remark 
first that 

lim tr{F-\W')A„{W^)F-\W')A,,{W^))=0 

n — >oo 

and 

hm trr-iC„(M^r,O = 

n — ^oo 

so 

lim„^^ff„(W^O) = lim„^oo4tr {F-^W°)A„{W°)F-\W'')A„{W°)) + 
2trF-\W°)B„{W^,W^) + 2trF-^Cn{WS ,W^) = 
^2tr {F,-'E{BiWS,W^))) 

Now, since < C<e(l + WZf) and 

IllfeS - SSwrll f Cte\\W- W^0||(1 + \\Zf), by standard arguments 
found, for example, in |Yao, 2000] we get 
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2.3 Asymptotic distribution of Tn 

In this section, we write Wn — argminwge^ Un{W) and 

= argminH^g©^ Un{W), where 0q is view as a subset of R**. The asymp- 
totic distribution of r„ is then a consequence of the previous section, namely, 
if we have to replace nUn{W) by its Taylor expansion around Wn and W^, 
following |Van der Vaart, 1998| chapter 16 we have : 

Tn = V^(Wn- T4^°)^ /oV^ [Wn " VK°) + Op(l) ^ x'-g 

3 Conclusion 

It has been show that, in the case of multidimensional output, the cost func- 
tion Un{W) leads to a test for the number of parameters in MLP simpler than 
with the traditional mean square cost function. In fact the estimator Wn is 
also more efficient than the least square estimator (see [Rynkiewicz, 2003] ). 
We can also remark that Un{W) matches with twice the "concentrated Gaus- 
sian log-likelihood" but we have to emphasize, that its nice asymptotic prop- 
erties need only moment condition on e and Z, so it works even if the dis- 
tribution of the noise is not Gaussian. An other solution could be to use an 
approximation of the covariance error matrix to compute generalized least 
square estimator : 

n 

- V (Ft - Fw {Zt)f (Yt - Fw (Zt)) , 
n ^ — ' 
t=i 

assuming that -T is a good approximation of the true covariance matrix of 
the noise r{W'^). However it take time to compute a good the matrix F and 
if we try to compute the best matrix F with the data, it leads to the cost 
function UniW) (see for example [Gallant, 1987| ). 

Finally, as we see in this paper, the computation of the derivatives of 
Un{W) is easy, so we can use the effective differential optimization techniques 
to estimate Wn and numerical examples can be found in [Rynkiewicz, 2003] . 
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